Sentiment analysis is mostly used to describe the emotions about the given topic. It can be anything, starting from reviews of the movies and twitters, ending with opinions written in social media by a certain person. It is a powerful tool to analyze current and future trends and opinions. In our report, we took on the workshop data from IMDb movie reviews. In the beginning, we made some data exploration and calculated some ratios to find any dependencies between variables (certain words in our case). We were trying to find out if there are words strictly connected with negative or positive emotions. To play with the number of occurrences of a certain word we used Zipf's law. Then we tried to find which approach of pre-processing data is the best for logistic regression or naive Bayes. In the end, we compared some common ML models in terms of accuracy in test data and their time of evaluation.
Let's have a look at the data we are going to work on. Review column is a text that sentiment we will analyze, label column tells us how each review is classified. Zero stands for negative, and one for the positive review. The first column doesn't give us any information so we will drop it.
The first step after downloading data is cleaning. We cleaned data using two different approaches. The first one is less strict so we just lowered all letters, get rid of commas and " 's " at the end of words, this part can be found in column "text". In the second one, beyond the action that is described above, we use lemmatization and stemming to reduce the number of words to analyze. They can be found in the column "text2". This more advanced approach reduces the number of words in data but takes a lot of time to clean data. We will check if it is worthwhile in further experiments.
Sometimes after cleaning whole reviews may be erazed. It happens when they are just emoji or made only by stopwords. In this case, we would have ''Nan-s". Those reviews should be excluded from training and testing data. For our luck this is not the case in our dataset, there is no need for further data cleaning.
Wordclouds are great as a decoration or headline in presentations, but they don't give us much information in the analysis. The main idea of them is to show which words were used mainly in a negative or positive meaning: the bigger word in the picture, the higher frequency in the data.
As said before word clouds are not the best way to make a verdict on but we can gain some information from them. Nouns like 'movie' or 'character' occur frequently in both clouds so they are not clear about the sentiment that stands behind them.
Difference in the shapes of vectorizer matrices due to the different text cleaning:
We have about 7500 reviews. The number of words that occur in the column 'text' is about 45k. In case of more strict data cleaning used in column 'text2' we have less than 31k of different words.
Now let's have a look at most frequently occurred words and their sentiment context after two different cleaning processes. The below tables are showing the top 10 occurring words from each cleaning approach.
As a comparison, we've considered two approaches to CountVectorizer: 1st - less cleaned data, 2nd - tokenization, removing stopwords and punctuations. We've printed 10 most common used words for those two methods. At first glance, we can notice that every word has almost the same number of negative and positive representations. But in 'non-cleaned' data, these words doesn't make sense in case of sentiment analysis.
To make sure of our previous insights, we will fit regression where we will try to find the correlation between negative and positive frequencies. If we will manage to fit model properly, it means that if words are frequent, they don't carry any valuable information about review sentiment.
Plots just confirmed what we've noticed above. The negative frequency of the words is almost the same as the positive one - especially when data is not cleaned. Most of the words are below 10000 on the first plot and below 2000 on the second one. From the second plot, we can remark that now there are more points which occur more often as a positive/negative word -> situation changes (for better) when we clean the data.
From the summary, we can note that R-squared statistics is almost 1, which in fact means that our data is really close to the fitted regression line.
Similar results: R-squared is lower than earlier, but still high.
Zipf's law is all about the occurrences of a certain word in a text or spoken language. It turns out that the presence of a word with comparison to the one that is most frequent is approximately 1/n where n means the n-th place in usage frequency. So the second most common word will occur 1/2 as frequent as the first, third 1/3 as first and so on. In the part below we had some fun showing that this law is also true ( or approximately true ) for our data set. Wiki and intuitive explanation
Red dashed lines represent the exact value of Zipf's function, the blue bars are frequences of words that occur in data set most commonly.
Taking the log-scale for the frequencies gives us a nice line. To make it more informative we added terms that occur in each segment
Same for more strictly cleared data
To get as much information as possible we tried different ratios and means. Table below shows the outcome of different calculations and ratios made on negative and positive sentiments
Plot of the harmonic mean of rate CDF and frequency CDF (for less cleaned data).
If a point is closer to the upper left corner, it is more positive, and if it is closer to the bottom right corner, it is more negative.
Plot of the harmonic mean of rate CDF and frequency CDF (for cleaned data).
In both cases, it has created an interesting, almost symmetrical shape.
Now the funniest part of ratio analysis. We created an interactive plot with the usage of BokehJS. All points indicate a word that meaning can be checkedby pointing the cursor at it.
After some data exploration, it is time to set hyperparameters for our model. We will take on the workshop which approach of cleaning text is most effective, how many words to use and what is better for data vectorisation (CountVectorizer vs Tfidf). Ngrams also will be taken into account, for each method.
We divided our dataset into the train and test set (in proportions 80:20). In both sets, there are about 50% of negative words and 50% of positive words.
Explanation about legend: 'full cleaned data' are reviews from 'text2' column, 'with stop words' is connected with reviews from column "text", ' 'no stop words' are also taken from "text" column but as name states, we deleted stop words.
We've compared accuracy on the test set of different data: with stop words, without stop words and cleaned one. It is clearly seen in the plot that data without stop words has the highest accuracy. What is more, there is no difference if we take 40000 or 50000 number of features.
In case of NB, the accuracy is the highest for the full cleaned data. The results for data with stopwords are definitely worse. But the difference between cleaned data and data without stop words is not relevant. Thus, in further steps, we won't use full cleaning - it isn't worthwile.
It's clearly visible that there aren't big differences between the methods. We also see that above 40000 there is a slight decrease so there is probably no point to use more than 40k features. Using ngrams longer than 3 seems to be pointless but we will take this hypotheis on the workshop in upcoming simulations
This time, unigram has definitely worse results, but all the other methods look almost similar. As we see using unigrams and NB are not coming together. For other ngrams NB is performing almost as good as logistic regression.
Dotted plots are responding to TFIDF and line plots to CountVectorizer. Definitely, TFIDF works better on our dataset. Once again unigram has worse result, whereas 3-gram and 6-gram work the best, again taking all 50k features seems to be pointless
For the NB classifier, unigrams look definitely worse in both cases. All the other methods work almost the same. Thus, taking into account this plot and the previous one, we will only consider 3-gram TFIDF in further analysis.
Beacause logistic regression is fully interpretable, we took a look at the words that had the highest (positive) and the lowest (negative) estimators for the model.
Most of the terms that are clustered in groups of certain sentiment make sens. The ones that computer takes as negative would be classified in the same way by a human (in this case by us) - the same happens for positive. It means that our regression makes sense and will probably work properly for other texts.
Some data has many more features(40k + ) than samples( about 7.5 k ), models tend to overfit. This is exactly the case in our data. Most of the training accuracy is about 100%, it means that our models are trying to fit the train data too much and this may have an impact on a score on the test set. To avoid this scenario we tried to put some additional constraints . We will also play with hyperparameter C that is responsible for extended to which models are trying to fit training data. All results are shown in tables belowe. Right one is presenting models with their default options (only ngrams and max number of features changed), left table is presenting scores on the same train and test set but with different hyperparameters.
To make a somehow wider conclusion we tried a couple of models and compared their accuracy and time of evaluation. Most of the models have accuracy close or higher than 85% what is positively surprising, because NLP without NN is often useless. In this case, standard models are giving us decent scores in a really short time. Again, logistic regression seems to be one of the best choices. While SVC is performing better, it is not so easy to find the most negative or positive sentences for this classifier. In case of logistic regression, we did it right above. It is good to know that hours spent in the math faculty are not going to be wasted and we can create a simple model that we fully understand and fulfill our requirements about the given task.
To make things more complicated we also tried to work on a data set given by Stanford University.To make this analysis we used the dataset prepared and described for the following paper. The hard part of the analysis is the fact that we have 5 classes, where 1 means very negative, 3 - neutral and 5 - very positive. We tried to find out if there is a possibility to create a basic model that will also have a decent outcome on this more sophisticated data. To somehow interpret the models if they match or not certain sentiments, we used confusion matrix and evaluate model according to its outcome.
NOTE: All the data and train-test split were taken from stanford webside
Both training and test set are not perfectly balanced in terms of class occurrences, but it shouldn't affect our models in terms of accuracy.
The logistic regression which is way simpler gave us about 40% and NB 39.7%. They are much easier and interpretable but it is up to us if we want better accuracy or explanation. All models struggle with classes that are close to each other, i.e. {1,2} or {4,5}. There is a lot of misclassification in that case. Neutral class (3) is almost always omitted and classified as 2 or 4. It is hard for a human to find out if a text has irony or if it is neutral, so for these classificators it is simply impossible to make the right decision. It is worth to mention that some models are very accurate for special cases, i.e. NB is perfect in finding very negative and positive comments.